Clustering suggestion for Chinese news web pages from multi-media sources
نویسندگان
چکیده
There exist some news obviously classified into incorrect categories on Chinese web pages portal. The main reasons could be that it is difficult to automatically classify Chinese news and the news appearing on web pages portal are retrieved from many media sources. In this study, we integrate genetic algorithm and multi-class support vector machine (SVM) classifier to construct a Chinese news classification method. In addition, we find that some similar documents are scattered in different categories. The main reason could be that the categories of original media sources are different from those of news web pages portal. Those similar news should be collected to form a new category. We try to combine genetic algorithm and fuzzy c-means algorithm to propose a new approach to offer clustering suggestion for news web pages that are scattered in different categories and are from multi-media sources.
منابع مشابه
Distribution of news information through social bookmarking: an examination of shared stories in the Delicious Website
Introduction. This study examined the selection and sharing of news stories from Delicious, a popular social bookmarking site, in order to identify the most frequently consulted news information sources and news topics. Method. Targeting US-specific sources through initial computer screening of URLs, we employed content analysis to further analyse story topics and sources that were unclassified...
متن کاملSemantically Enhanced Television News through Web and Video Integration1
The Rich News system for semantically annotating television news broadcasts and augmenting them with additional web content is described. On-line news sources were mined for material reporting the same stories as those found in television broadcasts, and the text of these pages was semantically annotated using the KIM knowledge management platform. This resulted in more effective indexing than ...
متن کاملA Near-duplicate Detection Algorithm to Facilitate Document Clustering
Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting t...
متن کاملClustering for Web Information Hierarchy Mining
Benefiting from the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively. The structures of Web pages which are dynamically generated by the same templates are thus similar to one another and are usually assembled by a set of fundamental information clusters These neighboring information clusters usually represent the similar semantics...
متن کاملA Platform for Multilingual News Summarization
We have developed a multilingual version of Columbia Newsblaster as a testbed for multilingual multi-document summarization. The system collects, clusters, and summarizes news documents from sources all over the world daily. It crawls news sites in many different countries, written in different languages, extracts the news text from the HTML pages, uses a variety of methods to translate the doc...
متن کامل